Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Add normalization functions. #7331

Closed
wants to merge 22 commits into from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Sep 11, 2024

Trac ticket: Core-62036.

See also dmsnell#20
See also dmsnell#21

The HTML Processor understands HTML regardless of how it's written, but
many other functions are unable to do so. There are all sorts of syntax
peculiarities and semantics that would be helpful to eliminate using the
knowledge contained in the HTML Processor.

This patch introduces WP_HTML_Processor::normalize( $html ) as a
method which takes a fragment of HTML as input and then returns a
serialized version of the input, "cleaning it up" by balancing all
tags, providing all missing optional tags, re-encoding all text,
removing all duplicate attributes, and double-quote-escaping all
attribute values.

php > var_dump( WP_HTML_Processor::normalize('<a href=#anchor v=5 href="/" enabled>One</a another v=5><!--') );
string(39) "<a href="#anchor" v="5" enabled>One</a>"

php > var_dump( WP_HTML_Processor::normalize( '<![CDATA[invalid comment]]> syntax < <> "oddities"' ) );
string(64) "<!--[CDATA[invalid comment]]--> syntax &lt; &lt;&gt; &quot;oddities&quot;"

php > var_dump( WP_HTML_Processor::normalize( '<textarea>use a &lt;/textarea></textarea>' ) );
string(44) "<textarea>use a &lt;/textarea&gt;</textarea>"

php > var_dump( ( WP_HTML_Processor::create_full_parser( '<p>Test<p>again' ) )->serialize() );
string(62) "<html><head></head><body><p>Test</p><p>again</p></body></html>"

The HTML Processor understands HTML regardless of how it's written, but
many other functions are unable to do so. There are all sorts of syntax
peculiarities and semantics that would be helpful to eliminate using the
knowledge contained in the HTML Processor.

This patch introduces `WP_HTML_Processor::normalize( $html )` as a
method which takes a fragment of HTML as input and then returns a
serialized version of the input, "cleaning it up" by balancing all
tags, providing all missing optional tags, re-encoding all text,
removing all duplicate attributes, and double-quote-escaping all
attribute values.

Core-62036
Copy link

github-actions bot commented Sep 11, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, westonruter, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice: force_balance_tags(): The Next Generation

src/wp-includes/html-api/class-wp-html-processor.php Outdated Show resolved Hide resolved
src/wp-includes/html-api/class-wp-html-processor.php Outdated Show resolved Hide resolved
*/
public function serialize(): ?string {
if ( WP_HTML_Tag_Processor::STATE_READY !== $this->parser_state ) {
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this make sense to throw an exception?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to avoid throwing exceptions in use code. Tell me more about the value of potentially crashing vs. returning null

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess to give more information about why it is returning null. Maybe _doing_it_wrong() then would be better? It would be helpful to get feedback in code for what is documented:

	 * This differs from {@see WP_HTML_Processor::normalize} in that it starts with
	 * a specific HTML Processor, which _must_ not have already started scanning;
	 * it must be in the initial ready state and will be in the completed state once
	 * serialization is complete.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay I see now. another thought I had was resetting to the beginning, parsing, and returning to the previous location, which involves double-parsing if already mid-way through a document.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've called wp_trigger_error() in these cases.

while ( $this->next_token() ) {
$token_type = $this->get_token_type();

switch ( $token_type ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about processing instructions? Shouldn't they get a special treatment?

For example, <html><body><?php foo(); ?> is interpreted as:

image

Seems like it should get serialized back in the same way? Maybe not since the browser serializes this as <!--?php foo(); ?-->. But maybe that should be an option?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good catch: it should also serialize the PI node tag name, which would match what you wrote. looks like this needs a review of all of the invalid comment syntax

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should all be updated now. if something lingers I'd like to fix it, but ultimately if we botch an invalid comment, I'm guessing it's not the end of the world.

these will go into test cases.

src/wp-includes/html-api/class-wp-html-processor.php Outdated Show resolved Hide resolved
}

if ( ! $in_html && $this->has_self_closing_flag() ) {
$html .= '/';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While not required, it seems a space is usually added here in the wild, right? (e.g. Prettier does this)

Suggested change
$html .= '/';
$html .= ' /';

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a good thought. with double-quoted attributes it's not relevant, but with unquoted attribute values it becomes relevant. we don't need that since we control quoting. maybe it's best to add it in anyway for the same of other tools.

Comment on lines 1199 to 1201
if ( null !== $this->get_last_error() ) {
return null;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, it would be helpful to know why it returned null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a tougher question because it would conflate the basic ?string return value. practically I think this can only occur if the HTML is unsupported (in which case we really shouldn't return any processed string) or we've run out of bookmarks (which should be unrealistically rare - and that reminds me, I found 2500 bookmarks sufficient to parse everything in my set of ~300k HTML documents, and I intend on upping the default value to support that for 6.7).

suppose you know why this failed: what would you do in response?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also use _doing_it_wrong() here too to communicate that information, I suppose. Or rather, wp_trigger_error() would be the more relevant function. If I knew why it failed then I wouldn't have to figure out why it failed. True it probably wouldn't impact the result in the end, but for debugging it would be useful.

Looking at where last_error is set, it seems to always coincide with throwing an exception anyway. So in practice would this if statement ever be entered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the exceptions thrown internally are caught and shut down parsing, but does not crash. unsupported content exceptions are returned via get_unsupported_exception()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've called wp_trigger_error() in these cases.

dmsnell and others added 7 commits September 11, 2024 11:11
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
If code later in the processing pipeline adds unquoted attributes
and doesn't add the requisite space following that, then another
parser might find that the solidus is part of the attribute value
instead of serving as a self-closing flag.

Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty exciting, I'd like to start adding tests for it.

I just added it to the html api debugger when supported.

I'd love to start adding tests for this. One good test will be idempotency, where after an initial normalization, subsequent normalizations will be identical.

This mentions null bytes here specifically:

Text will be re-encoded, null bytes handled, and invalid UTF-8 replaced with U+FFFD.

I think that's working correctly in text. Should it also be handled in tag names, attribute names, and attribute values?

Input (null bytes replaces for clarity)

<div␀-nb nb-att-␀-="nb-val-␀-">

Normalized output:

<div␀-nb nb-att-␀-="nb-val-␀-"></div␀-nb>

Expected:

<div�-nb nb-att-�-="nb-val-�-"></div�-nb>

src/wp-includes/html-api/class-wp-html-processor.php Outdated Show resolved Hide resolved
@sirreal
Copy link
Member

sirreal commented Sep 12, 2024

There are some known issues from HTML5lib tests similar to the PI problems mentioned here: #7331 (comment)

'comments01/line0155' => 'Unimplemented: Need to access raw comment text on non-normative comments.',
'comments01/line0169' => 'Unimplemented: Need to access raw comment text on non-normative comments.',
'html5test-com/line0129' => 'Unimplemented: Need to access raw comment text on non-normative comments.',

There's no good way to read the comment under some circumstances and something like a get_raw_comment_content() method would be helpful.



<?import namespace="foo" implementation="#bar">

Each of these does not satisfy the PI constraint (missing ? before the > closer) so they're treated as invalid HTML comments. The initial ? isn't accessible through get_modifiable_text(), modifying that character could change the token to something completely different.

There are a couple of cases like this, I think they're all <? or <!-started strings triggering the bogus comment state.

CORRECTION:

I've edited it, it was initially incorrect. The bogus comments starting with <! do ignore the ! in their contents. Only <? seem to be mishandled.

Input Expected Actual Correct
<?xml foo > <!--?xml foo --> <!--xml foo --> ⛔️
<!> <!----> <!---->
<! more stuff > <!-- more stuff --> <!-- more stuff -->

@sirreal
Copy link
Member

sirreal commented Sep 12, 2024

We'd talked about a method to really inspect different types of comment text content. I've proposed a method in #7342. That would be helpful here.

Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is nice and it seems like it's in a good place. I'm happy to see it getting tests.

I'd like it if null bytes were normalized in more places (tag names, attribute names and values) before this lands.

src/wp-includes/html-api/class-wp-html-processor.php Outdated Show resolved Hide resolved
pento pushed a commit that referenced this pull request Sep 20, 2024
HTML often appears in ways that are unexpected. It may be missing implicit tags, may have unquoted, single-quoted, or double-quoted attributes, may contain duplicate attributes, may contain unescaped text content, or any number of other possible invalid constructions. The HTML API understands all fo these inputs, but downline parsers may not, and HTML snippets which are safe on their own may introduce problems when joined with other HTML snippets.

This patch introduces the `serialize()` method on the HTML Processor, which prints a fully-normative HTML output, eliminating invalid markup along the way. It produces a string which contains every missing tag, double-quoted attributes, and no duplicates. A `normalize()` static method on the HTML Processor provides a convenient wrapper for constructing a fragment parser and immediately serializing.

Subclasses relying on the `serialize_token()` method may perform structural HTML modifications with as much security as the upcoming `\Dom\HTMLDocument()` parser will, though these are not
able to provide the full safety that will eventually appear with `set_inner_html()`.

Further work may explore serializing to XML (which involves a number of other important transformations) and adding constraints to serialization (such as only allowing inline/flow/formatting elements and text).

Developed in #7331
Discussed in https://core.trac.wordpress.org/ticket/62036

Props dmsnell, jonsurrell, westonruter.
Fixes #62036.


git-svn-id: https://develop.svn.wordpress.org/trunk@59076 602fd350-edb4-49c9-b593-d223f7449a82
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Sep 20, 2024
HTML often appears in ways that are unexpected. It may be missing implicit tags, may have unquoted, single-quoted, or double-quoted attributes, may contain duplicate attributes, may contain unescaped text content, or any number of other possible invalid constructions. The HTML API understands all fo these inputs, but downline parsers may not, and HTML snippets which are safe on their own may introduce problems when joined with other HTML snippets.

This patch introduces the `serialize()` method on the HTML Processor, which prints a fully-normative HTML output, eliminating invalid markup along the way. It produces a string which contains every missing tag, double-quoted attributes, and no duplicates. A `normalize()` static method on the HTML Processor provides a convenient wrapper for constructing a fragment parser and immediately serializing.

Subclasses relying on the `serialize_token()` method may perform structural HTML modifications with as much security as the upcoming `\Dom\HTMLDocument()` parser will, though these are not
able to provide the full safety that will eventually appear with `set_inner_html()`.

Further work may explore serializing to XML (which involves a number of other important transformations) and adding constraints to serialization (such as only allowing inline/flow/formatting elements and text).

Developed in WordPress/wordpress-develop#7331
Discussed in https://core.trac.wordpress.org/ticket/62036

Props dmsnell, jonsurrell, westonruter.
Fixes #62036.

Built from https://develop.svn.wordpress.org/trunk@59076


git-svn-id: http://core.svn.wordpress.org/trunk@58472 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request Sep 20, 2024
HTML often appears in ways that are unexpected. It may be missing implicit tags, may have unquoted, single-quoted, or double-quoted attributes, may contain duplicate attributes, may contain unescaped text content, or any number of other possible invalid constructions. The HTML API understands all fo these inputs, but downline parsers may not, and HTML snippets which are safe on their own may introduce problems when joined with other HTML snippets.

This patch introduces the `serialize()` method on the HTML Processor, which prints a fully-normative HTML output, eliminating invalid markup along the way. It produces a string which contains every missing tag, double-quoted attributes, and no duplicates. A `normalize()` static method on the HTML Processor provides a convenient wrapper for constructing a fragment parser and immediately serializing.

Subclasses relying on the `serialize_token()` method may perform structural HTML modifications with as much security as the upcoming `\Dom\HTMLDocument()` parser will, though these are not
able to provide the full safety that will eventually appear with `set_inner_html()`.

Further work may explore serializing to XML (which involves a number of other important transformations) and adding constraints to serialization (such as only allowing inline/flow/formatting elements and text).

Developed in WordPress/wordpress-develop#7331
Discussed in https://core.trac.wordpress.org/ticket/62036

Props dmsnell, jonsurrell, westonruter.
Fixes #62036.

Built from https://develop.svn.wordpress.org/trunk@59076


git-svn-id: https://core.svn.wordpress.org/trunk@58472 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@dmsnell
Copy link
Member Author

dmsnell commented Sep 20, 2024

Merged in [59076]
03b12dc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants